Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs
نویسندگان
چکیده
Large amounts of data are stored in relational DBMSs. However, statistical analysis is frequently performed outside the DBMS using statistical tools, such as the well-known R package, leading to slow processing when data sets cannot fit in main memory and going through a file export bottleneck. In this article, we propose algorithms for large data set processing of principal component analysis (PCA) and stochastic search variable selection (SSVS) that can work entirely inside a DBMS, using SQL queries and User-Defined Functions (UDFs). Both of our algorithms consist of two main phases: a first phase to compute sufficient statistics in one pass with SQL queries and a second one to derive the model from such such sufficient statistics, in main memory with UDFs. PCA is efficiently solved with SVD via UDFs in main memory after sufficient statistics are derived. On the other hand, the traditional SSVS algorithm requires multiple passes to compute a model. In contrast, our improved Bayesian algorithm performs a single table scan on the input data set and then the UDF performs thousands of iterations on small matrices. In addition, we incorporate optimizations that exploit DBMS multi-threaded processing capabilities to compute multidimensional aggregates in data summarization. Specifically, we present low-level optimizations to distribute the workload among multiple cores, accessing records by block and caching in main memory. Experiments with large data sets results demonstrate the efficiency of our optimizations to compute sufficient statistics and show our algorithms have linear scalability on the size of the data set. Finally, a detailed comparison against R, the standard open-source package for statistical research, shows correctness and superior speed of our DBMS-based algorithms to process very large datasets. ∗This work was partially supported by NSF grants CCF 0937562 and IIS 0914861. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-LDMTA’10, July 25, 2010, Washington, DC, USA. Copyright 2010 ACM 978-1-4503-0215-9/10/07 ...$10.00.
منابع مشابه
Spatial Design for Knot Selection in Knot-Based Low-Rank Models
Analysis of large geostatistical data sets, usually, entail the expensive matrix computations. This problem creates challenges in implementing statistical inferences of traditional Bayesian models. In addition,researchers often face with multiple spatial data sets with complex spatial dependence structures that their analysis is difficult. This is a problem for MCMC sampling algorith...
متن کاملvarbvs: Fast Variable Selection for Large-scale Regression
We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studie...
متن کاملThe Gamma Operator for Big Data Summarization on an Array DBMS
SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper, we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable sel...
متن کاملSelection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets
Background: Prisoners, compared to the general population, are at greater risk of infection. Drug injection is the main route of HIV transmission, in particular in Iran. What would be of interest is to determine variables that govern drug injection among prisoners. However, one of the issues that challenge model building is incomplete national data sets. In this paper, we addressed the process ...
متن کاملBayesian regression based on principal components for high-dimensional data
Motivated by a climate prediction problem, we consider high dimensional Bayesian regression where the number of covariates is much larger than the number of observations. To reduce the dimension of the covariate, the response is regressed on the principal components obtained from the covariates, and it is argued that the PCA regression is equivalent to the original model in terms of prediction....
متن کامل